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We present a method of parallelizing the stochastic cutoff (SCO) method, which is a Monte-Carlo method for long- 
range interacting systems. After interactions are eliminated by the SCO method, we subdivide a lattice into noninter¬ 
acting interpenetrating sublattices. This subdivision enables us to parallelize the Monte-Carlo calculation in the SCO 
method. Such subdivision is found by numerically solving the vertex coloring of a graph created by the SCO method. 
We use an algorithm proposed by Kuhn and Wattenhofer to solve the vertex coloring by parallel computation. This 
method was applied to a two-dimensional magnetic dipolar system on an L x L square lattice to examine its paralleliza¬ 
tion efficiency. The result showed that, in the case of L = 2304, the speed of computation increased about 102 times by 
parallel computation with 288 processors. 


1. Introduction 

It is widely recognized that the recent trend in computa¬ 
tional physics is parallel computing with a large number of 
computational resources. This recognition is supported by the 
fact that all of the top 100 supercomputers released in Novem¬ 
ber 2014 consist of more than ten thousand cores.'* Eurther- 
more, computation with a graphics processing unit (GPU) has 
been a hot topic in recent years^^'"* because it enables us to 
perform massively parallel computing at a fraction of the cost. 
To fully utilize these parallel architectures, the development 
of efficient parallel algorithms is indispensable. Such parallel 
algorithms are required particularly in long-range interacting 
systems because of their high computational cost. 

One example in which the parallelization of computation 
has been successfully achieved is the molecular dynamic 
(MD) method (see Refs. 11, 12, and references therein). If 
there are only short-range forces, parallel computations are 
performed by dividing the simulation box into cubic domains 
and assigning each domain to each processor. *■'* If the sys¬ 
tem involves long-range forces such as the Coulomb ones, 
long-range forces for all the molecules are efficiently calcu¬ 
lated with 0{N log N) or 0{N) computational time {N is the 
number of molecules) by sophisticated methods such as the 
Barnes-Hut tree algorithm,first multipole method,'^’ 
and particle mesh Ewald method. *^’ Because these methods 
can be parallelized, it is possible to perform parallel compu¬ 
tations in the MD methods even in the presence of long-range 
forces. 

The main reason why parallel computations in the MD 
methods are relatively simple lies in their simultaneous fea¬ 
ture. In the MD methods, forces acting on all the molecules 
are calculated at the beginning of each step, and new posi¬ 
tions of molecules in the next time step are determined simul¬ 
taneously using forces calculated in advance. In contrast, in a 
normal Monte-Carlo (MC) method, elements such as particles 
and spins are moved one at a time. This sequential feature of 
the MC method makes parallelization difficult. If the system 


involves only short-range interactions, it is still possible to 
perform parallel computations in the MC method. Eor exam¬ 
ple, MC simulations in lattice systems can be parallelized by 
a checkerboard decomposition.^'* Even in off-lattice systems, 
parallel computations are still possible by a spatial decompo¬ 
sition method.^^“^^* The spatial-decomposition technique is 
also used in the kinetic (event-driven) MC method in short- 
range interacting systems to parallelize the computation. 
However, such a spatial-decomposition method does not work 
in long-range interacting systems because all the elements in¬ 
teract with each other. Eurthermore, the above-mentioned ef- 
hcient algorithms used in the MD method to calculate long- 
range forces do not work in the MC method because these 
methods calculate long-range forces (and potentials) for all 
the elements at once. In the MC methods, long-range forces 
calculated for all the elements become invalid after a part of 
the elements are updated because of the sequential feature of 
the MC method. Therefore, it is difficult for long-range inter¬ 
acting systems to perform parallel computations in a normal 
(and most widely applicable) single-update MC method. 

To overcome this difficulty, we utilize the stochastic cut¬ 
off (SCO) method.^'*“^^* The SCO method is a Monte-Carlo 
method for long-range interacting lattice systems. The basic 
idea of the method is to switch long-range interactions Vjj 
stochastically to either zero or a pseudointeraction Vij using 
the Stochastic Potential Switching (SPS) algorithm.■'^’^"'* The 
SPS algorithm enables us to switch the potentials with the de¬ 
tailed balance condition strictly satished. Therefore, the SCO 
method does not involve any approximation. Eukui and Todo 
have developed an efficient MC method based on a similar 
strategy using different pseudointeractions and different way 
of switching interactions.^^* Because most of the distant and 
weak interactions are eliminated by being switched to zero, 
the SCO method markedly reduces the number of interactions 
and computational time in long-range interacting systems. Eor 
example, in a two-dimensional magnetic dipolar system, to 
which we will apply our MC method later, the number of po¬ 
tentials per spin and the computational time for a single-spin 
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update are reduced from 0{N) to (9(1). 

This reduction of potentials also makes it possible to sub¬ 
divide the lattice into noninteracting interpenetrating sublat¬ 
tices, i.e., so that the elements on a single sublattice do not 
interact with each other. This subdivision enables us to par¬ 
allelize the computation. However, one problem is that there 
is no trivial way of hnding such a subdivision. In the case of 
Ising models on a square lattice with nearest-neighbouring in¬ 
teractions, the checkerboard decomposition is responsible for 
it. In contrast, there is no trivial subdivision in the this case 
because interactions are stochastically switched. To resolve 
this problem, we numerically solve the vertex coloring on a 
graph created by the potential switching procedure. This com¬ 
putation is performed in a parallel fashion using an algorithm 
proposed by Kuhn and Wattenhofer.^®’^^^ 

The paper is organized as follows; In Sect. 2, we briefly ex¬ 
plain the SCO method and describe the parallel computation 
of the vertex coloring, which is a key feature of the present 
method. In Sect. 3, we show the results obtained by apply¬ 
ing the present method to a two-dimensional magnetic dipolar 
system. In Sect. 4, we give our conclusions. 


2. Methods 

2.7 Stochastic Cutoff (SCO) Method 

In this subsection, we briefly explain the SCO method. We 
consider a system with pairwise long-range interactions de¬ 
scribed by a Hamiltonian 77 = Yii<j yij{Si,Sj), where S, is a 
variable associated with the /-th element of the system. In the 
SCO method, Vij is stochastically switched to either 0 or a 
pseudointeraction Vij as 


Vu(Si,Sj) 


jo prob.:Pij(Si,Sj), 

prob.:l-PijiSi,Sj). 


The probability Pij and the pseudointeraction Vij are given by 
PijiSi, Sj) = cxpmVijiSi, Sj) - y”“)], (2) 


VijiSi, Sj) = VijiS^, Sj) - log[ 1 - PijiSi, S;)], (3) 

where p is the inverse temperature and y™“ is a constant 
equal to (or greater than) the maximum value of Vij over all S, 
and Sj. With this potential switching, the algorithm proceeds 
as follows: 

(A) Switch the potentials Vij to either 0 or Vij with the prob¬ 
ability of Pij or 1 - Pij, respectively. 

(B) Perform a standard MC simulation with the switched 
Hamiltonian 

= (4) 

for «sw MC steps, where Y!ij runs over all the potentials 
switched to Vij and one MC step is defined by one trial 
for each S, to be updated. 

(C) Return to (A). 

In the SCO method, an efficient method is employed to re¬ 
duce the computational time of the potential switching in step 
(A) (see Ref. 30 for details). As a result, the computational 
time in step (A) becomes comparable to that in step (B) per 





Fig. 1. (Color online) Schematic illustration of parallel computations of the 
SCO method. A vertex denotes a variable S,- and an edge denotes a potential 
Vij or Vij. In step (1), each potential is switched to either 0 or Vij. The edges 
whose potentials are switched to 0 are eliminated in the subsequent steps. In 
step (2), the vertex coloring of the graph is solved numerically in a parallel 
fashion. In step (3), variables with a specific color are updated simultaneously 
by a standard MC simulation. This procedure is canied out for all the colors. 


MC step. For example, in the case of a two-dimensional mag¬ 
netic dipolar system, both computational times are reduced to 
0{N). 

2.2 Outline of parallel computations of the SCO method 

Figure 1 shows a schematic illustration of parallel compu¬ 
tations of the SCO method. A vertex denotes a variable S, and 
an edge denotes a potential Vij or Vij. In step (I), each poten¬ 
tial is switched to either 0 or Vij. The edges whose potentials 
are switched to 0 are eliminated in the subsequent steps. Be¬ 
cause each potential is switched independently with the prob¬ 
ability Pij, we can easily parallelize the computation in this 
step. In step (2), the vertex coloring of the graph created in 
step (1) is solved numerically in a parallel fashion. The par¬ 
allel computation of the vertex coloring will be explained in 
detail in the next subsection. Lastly, we perform a standard 
MC simulation with the switched Hamiltonian 77' in step (3). 
It is apparent from the definition of the vertex coloring that 
variables with the same color do not interact with each other. 
Therefore, we can update variables with a specific color by 
parallel computation, as we do in MC simulations of Ising 
models by a checkerboard decomposition. By doing this si¬ 
multaneous update for all the colors, we can parallelize the 
MC calculation in step (3). 

2.3 Parallel computation of the vertex coloring 

In this subsection, we briefly explain parallel computation 
of the vertex coloring. We refer the reader to the book in 
Ref. 37 for more details. By solving the vertex coloring in 
a parallel fashion, we can perform all the three steps men¬ 
tioned in the previous subsection by parallel computation. We 
hereafter call the vertex coloring by parallel computation dis¬ 
tributed graph coloring. 

The organization of this subsection is as follows: In 
Sect. 2.3.1, we explain the basis of the distributed graph col¬ 
oring. In Sect. 2.3.2, we explain a basic color reduction al- 
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Fig. 2. (Color online) A graph and its initial coloring to which we apply 
the BCR algorithm. The number of vertices N and the maximum degree A 
are 8 and 3, respectively. The number of colors a is 6. The vertices recolored 
in step (2) and those recolored in step (2)’ are enclosed by solid circles and 
dashed ones, respectively. 


gorithm for the distributed graph coloring. This algorithm is 
used in an algorithm proposed by Kuhn and Wattenhofer,^^’ 
which is used in this study. This algorithm is explained in 
Sect. 2.3.3. 

2.3.1 Basis of the distributed graph coloring 

We start with the introduction of several technical terms 
in the graph theory. The degree of a vertex is the number of 
edges that connect the vertex with others, and the maximum 
degree is the largest value of the degrees of a graph. In gen¬ 
eral, it is known that a graph with a maximum degree A can 
be colored with A + 1 colors, while, in most cases, it is not the 
smallest number of colors needed to color the graph. The aim 
of the distributed graph coloring is to color a graph with A H-1 
colors by parallel computation. 

In the distributed graph coloring, each vertex is initially 
colored by different colors, i.e., a graph is colored with N col¬ 
ors, where N is the number of vertices. The number of col¬ 
ors is gradually reduced from to A H- 1 by repeating syn¬ 
chronous communication and parallel computation. In syn¬ 
chronous communication, vertices communicate with each 
other to know the colors of their neighbouring vertices. In par¬ 
allel computation, all vertices simultaneously recolor them¬ 
selves. The new color is locally calculated by using the infor¬ 
mation of the neighbouring colors obtained in the preceding 
communication. Vertices do not communicate with each other 
in parallel computation. The number of times of synchronous 
communications required to accomplish a (A H- l)-coloring is 
called running time, hereafter denoted as Tr. The main aim 
of the distributed graph coloring is to reduce fR as much as 
possible. 

2.3.2 Basic color reduction algorithm 

The basic color reduction (BCR) algorithm is one of the 
most fundamental algorithms for the distributed graph col¬ 
oring. Figure 2 shows a graph and its coloring to which we 


apply the BCR algorithm. We assume that the number of 
vertices and the maximum degree of the graph are N and 
A, respectively. The graph is initially colored with a colors 
(A H- 1 < a < N) and the coloring is valid, i.e., no adjacent 
vertices share the same color. The color of a vertex is speci¬ 
fied by an integer between 1 and a. In the BCR algorithm, the 
number of colors is reduced from cr to a - 1 by the following 
steps:^^^ 

(1) Each vertex communicates with each other to obtain the 
colors of the neighbouring vertices. 

(2) Each vertex recolors itself if its color is a. The new color 
is chosen from a palette between 1 and A H- 1 by using 
the information obtained in step (1). 

Steps (1) and (2) correspond to synchronous communication 
and parallel computation in the previous subsection, respec¬ 
tively. We can always choose a new color among A -H 1 col¬ 
ors because the maximum degree of the graph is A. It is also 
important to notice that the vertices with the color a cannot 
be adjacent to each other because the initial coloring is valid 
(see the vertices enclosed by a solid circle in Eig. 2). This 
means that the new coloring is also valid even if each vertex 
with the color a simultaneously changes its color according 
to the information of the neighbouring colors. The BCR algo¬ 
rithm reduces the number of colors one at a time by repeating 
these two steps. Therefore, the running time fR to accomplish 
a (A -H l)-coloring from an initial A^-coloring is - A - 1. 

When we implemented the BCR algorithm in our simula¬ 
tion, we slightly modified the algorithm to improve its effi¬ 
ciency. To be specific, we modified step (2) in the following 
manner: 

(2)’ Each vertex recolors itself if its color is locally maxi¬ 
mum. The new color is chosen from a palette between 1 
and A -H 1 using the information obtained in step (1). 

In Eig. 2, the vertices recolored in steps (2)’ and (2) are en¬ 
closed by dashed circles and solid ones, respectively. We see 
that the former involves the latter. This means that the running 
time is reduced by this modification. We also find in Eig. 2 
that the vertices recolored in step (2)’ are not adjacent to each 
other because they are locally maximum. Therefore, the new 
coloring is also valid by the same reason as before. A demerit 
of this modification lies in the computational cost to check 
whether the color of a vertex is locally maximum. However, 
this demerit was not significant in our simulations because the 
degrees of graphs were not so large. 

2.3.3 KW algorithm 

In this subsection, we explain an algorithm proposed by 
Kuhn and Wattenhofer.^®^ We hereafter call it the KW algo¬ 
rithm. The KW algorithm markedly reduces the running time 
fR by applying the BCR algorithm recursively. As mentioned 
above, we used this algorithm to numerically solve the vertex 
coloring. 

Eigure 3 shows a schematic illustration of the KW algo¬ 
rithm. Eor simplicity, we assume that the number of ver¬ 
tices N and the maximum degree A are related by A = 
(Ah- 1) X 2^, where M is an integer. Generalization to other 
cases is straightforward. We suppose that all the vertices are 
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Fig. 3. (Color online) Schematic illustration of the KW algorithm. The 
number of vertices N is related to the maximum degree A by = (A+ l)x2^ 
with M = 2. After two groups Gi(l) and Gi(2) are made by integrating two 
adjacent groups at level 0, we apply the color reductions and Rii2) 

to the groups Gi(l) and Gi{2), respectively. These two color reductions are 
performed simultaneously. The number of colors is halved from 4(A + 1) to 
2(A + 1) by the two color reductions. We then apply the color reduction i? 2 (l) 
to the group G 2 (l), which is made by integrating the two groups Gi(l) and 
Gi(2). As a result, the number of colors is reduced from 2{A + 1) to A + 1. 



Fig. 4. (Color online) The circular component of the magnetization 
defined by Eq. (7) is plotted as a function oiT jj. The data for single-thread 
computation with 1 processor and those for multiple-thread computation with 
8 processors are denoted by squares and circles, respectively. The size L is 
128. The average is taken over 10 different runs with different initial condi¬ 
tions and random sequences. 


initially colored by different colors. The color is specified 
by an integer between 1 and N. The KW algorithm starts 
by partitioning all vertices into V/(A +1) = 2^ groups ac¬ 
cording to their colors. We hereafter denote them by G^ik) 
{k - 1 , 2 , • ■ ■ , 2 “), where the subscript “0” represents the 
level of the grouping. In this partitioning, vertices whose color 
is between l-i-(A:-l)(A+l) and k{A+l) are assigned to the k- 
th group Go(k). We next make groups at level 1 by integrating 
two adjacent groups at level 0 (see Fig. 3). We denote them by 
Gi(k) (k - 1,2, • ■ ■ , 2^“'). Just after the integration, 2(A -H 1) 
vertices in the group Gi{k) are colored by 2(A -i-1) colors. We 
then apply the BCR algorithm to reduce the number of col¬ 
ors from 2(A -i- 1) to A -H 1. We denote this color reduction 
applied to vertices in the group Gi{k) by Ri{k). We can simul¬ 
taneously perform all the color reductions at level 1 because 
there is no overlap of colors among the groups. In the color re¬ 
duction Riik), the color of vertices is changed so that the new 
color is between I + (k - 1)(A + 1) and k{A + 1). This guar¬ 
antees that there is no overlap of colors among groups at level 
1 even after the color reductions. As a result of all the color 
reductions at level 1, the number of colors used for coloring 
the whole graph is reduced from N to N/2. By repeating the 
integration of two adjacent groups and the subsequent parallel 
color reduction M times, the number of colors is reduced to 
A-h 1. 

We next consider the running time of the KW algorithm. At 
a level p, there are 2^^’’ groups. In each group, the number 
of colors is reduced from 2(A -t- 1) to A -i- 1 by the BCR algo¬ 
rithm. Now, the point is that we can simultaneously perform 
BCRs in all the 2^^’’ groups. To be specific, we simultane¬ 
ously perform BCRs in all the groups to reduce the number of 
colors by one, and perform synchronous communication just 
once for the next color reductions. By repeating this proce¬ 
dure A -I- 1 times, we can reduce the numbers of colors of all 
the groups from 2(A-i-l)toA-i-l. The running time to achieve 
this color reduction at the level p is A -H 1. Because the number 
of levels is M, the total running time to reduce the number of 


colors from A to A H- 1 is estimated to be 

fR = (A+l)xM = (A+l)log2(^), (5) 

where we have used the relation N = (A + 1)2''^. If N » 
A, this running time is much shorter than that of the BCR 
algorithm, which is, as mentioned above, on the order of N. 


3. Results 

3.1 Model 

To investigate the efficiency of parallel computation of the 
method developed in this study, we apply the method to a two- 
dimensional magnetic dipolar system on an L x L square lat¬ 
tice with open boundaries. The Hamiltonian of the system is 
described as 


(ij) 


KJ 


Si-Sj ^{Srnj)(Sj-nj) 


u 


, ( 6 ) 


where 5,- is a classical Heisenberg spin of |S,j = 1, {ij} 
runs over all the nearest-neighbouring pairs, r/y is the vec¬ 
tor spanned from a site i to j in the unit of the lattice con¬ 
stant a, and rij - |r,j|. The first term describes short-range 
ferromagnetic exchange interactions and the second term de¬ 
scribes long-range dipolar interactions, where J{> 0) is an ex¬ 
change constant and D{> 0) is a constant that represents the 
strength of magnetic dipolar interactions. We hereafter con¬ 
sider the case that DjJ - 0.1. We choose this model because 
it was used as a benchmark of the SCO method.It is es¬ 
tablished that the model undergoes a phase transition from a 
paramagnetic state to a circularly ordered state at ^ 
as a consequence of the cooperation of exchange and dipolar 
interactions.^®^ We applied the SCO method only for mag¬ 
netic dipolar interactions. The system was gradually cooled 
from an initial temperature T - 1.257 to 0.057 in steps of 
AT - 0.057. The initial temperature was set to be well above 
the critical temperature. We set Usw defined in Sect. 2.1 to 
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Table I. Temperature dependences of the mean degree {k} and the maxi¬ 
mum degree (A). Graphs are created by the potential switching in the SCO 
method. The size L is 2304. The average is taken over 35 graphs. 


temperature 


<A> 

1.257 

1.40 

8.54 

0.457 

3.36 

12.9 

0.057 

22.7 

41.6 



123456789 10 


r 

Fig. 5. (Color online) The probability Psurvive that a potential survives by 
being switched to V is plotted as a function of r, where r is the distance 
between two interacting sites. The size L is 2304. The temperatures are 0.057 
(squares), 0.457 (circles), and 1.257 (triangles), respectively. 


be 100, i.e., potential switching and subsequent vertex color¬ 
ing are performed every 100 MC steps. It was determined in 
Ref. 30 that this frequency of potential switching is sufficient 
for this model to obtain reliable results. 

To check the correctness of our parallel computation, we 
performed MC simulation and measured the absolute value 
of the circular component of magnetization defined by 



where [■ ■ • denotes the z-component of a vector, {■ ■ ■) de¬ 
notes thermal average, and is a vector describing the cen¬ 
ter of the lattice. In this measurement, the system was kept 
at each temperature for 100,000 MC steps. The first 50,000 
MC steps are for equilibration and the following 50,000 MC 
steps are for measurement. We performed simulations for 10 
different runs with different initial conditions and random se¬ 
quences. The result is shown in Fig. 4. The squares and circles 
denote the result of single-thread computation with 1 proces¬ 
sor and that of multiple-thread computation with 8 processors, 
respectively. Both data coincide with each other within statis¬ 
tical error. We also see that rapidly increases around the 
critical temperature Tc ss 0.887. From these results, we con¬ 
clude that our parallel computation is performed correctly. 
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Fig. 6. (Color online) The proportion Pipe of surviving potentials that re¬ 
quire interprocessor communication is plotted as a function of temperature. 
The size L is 2304. The numbers of processors Aproc’s are 48 (triangles), 144 
(circles), and 288 (squares), respectively. 
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3.2 Properties of graphs and improvements to reduce com¬ 
munication traffic 

In Table I, we show the mean degree (k) and the maximum 
degree (A) of graphs at three temperatures. These are impor¬ 
tant quantities because the maximum degree determines the 
number of colors and the mean degree (k) is proportional to 
the computational time per MC step. The size L is 2304. As 
found in Ref. 30, these quantities hardly depend on the size in 
two-dimensional magnetic dipolar systems if the size is suffi¬ 
ciently large. As expected from Eq. (2), both {k} and (A) in¬ 
crease with decreasing temperature. However, they are several 
tens at most. This means that most of the interactions are cut 
off by the potential switching. It should be noted that both (k) 
and (A) are N- 1 a; 5 x 10® before potentials are switched. Fig¬ 
ure 5 shows the distance dependence of the probability Psurvive 
that a potential survives by being switched to V. The temper¬ 
atures are the same as those in Table I. We see that the proba¬ 
bility increases with decreasing temperature. The probability 
is close to one when T - 0.057 and r - 1. However, Psurvive 
rapidly decreases with increasing r at any temperature. 

Taking these properties of the SCO method into consider¬ 
ation, we implemented our simulation in the following way: 
We first divide a lattice into Aproc square cells (Aproc is the 
number of processors) and assign each cell to each processor. 
We then list the vertices whose information should be sent 
by interprocessor communication when we update spins with 
a certain color. For example, a vertex i is added to a list for 
red-spin update if it satisfies the following two conditions: 

• The vertex i is connected with a red vertex j. 

• The two vertices i and j belong to different cells. 

This list is made for each color just once when a new graph 
is created by the potential switching. When we update spins 
of a certain color, we perform interprocessor communication 
in advance according to the list. Although it requires some 
computational cost to make the lists, they enable us to reduce 
the communication traffic before parallel MC calculation as 
much as possible. Figure 6 shows the temperature dependence 
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Fig. 7. (Color online) The average computational time 4ve in the strong 
scaling is plotted as a function of the number of processors Npmc for L = 2304 
(squares) and L = 1152 (circles). 4ve is defined by Eq. (8). The average is 
taken over the temperatures between 0.05/ and 1.25/. 



Fig. 8. (Color online) The ratio fave(iVproc)/fave(l) in the weak scaling is 
plotted as a function of Aproc, where 4ve(A) is the average computational time 
defined by Eq. (8) when the number of processors is X. The numbers of sites 
per processor A'^^’s are 50^ (triangles), 100^ (squares), and 300^ (circles), 
respectively. The average is taken over the temperatures between 0.05/ and 
1.25/. 


of the proportion P\pc of surviving potentials that require in¬ 
terprocessor communication. The proportion increases with 
increasing number of processors. Note that the mesh size de¬ 
creases as the number of processors increases. The proportion 
also increases with decreasing temperature. However, it is less 
than 20% in most cases, meaning that communication traffic 
is considerably reduced by this improvement. 

3.3 Efficiency of parallel computation 

In Fig. 7, we plot the average computational time per MC 
step 4ve as a function of the number of processors. The aver¬ 
age time fave is defined by 

4ve ~ switch + y^^fcolor + ^MC? (8) 

where fswitch, Icoior, and t^c are the computational times to 
switch potentials, to solve vertex coloring, and to perform MC 


simulation for one MC step, respectively. Recall that potential 
switching and the subsequent vertex coloring are performed 
every 100 MC steps. The average is taken over the temper¬ 
atures between 0.05 J and 1.257. The data for L - 2304 and 
those for L - 1152 are denoted by squares and circles, respec¬ 
tively. The data correspond to the strong scaling because Nproc 
is increased with the system size L fixed. When L is 2304, the 
computations with 144 and 288 processors are about 85 and 
102 times faster than that with one processor, respectively. In 
the case of L = 1152, the speedups by 144 and 288 processors 
are about 58 and 57, respectively. 

Figure 8 shows the data of the weak scaling. In the weak 
scaling, we increase both the system size and Nproc with the 
number of sites per processor fixed. In the figure, the ra¬ 
tio fave(Nproc)/4ve(l) is plotted as a function of Nproc, where 
fave(2f) is the average computational time when the number 
of processors is X. We see that the ratio decreases with in¬ 
creasing This means that the parallelization efficiency is 
improved as the system size increases. 

We next consider fluctuations in the number of sites as¬ 
signed to each processor in the MC calculation. As mentioned 
in Sect. 3.2, when we update spins with some color in the MC 
calculation, each processor updates spins in the assigned cell. 
The number of sites with the color is different from cell to cell. 
Because these fluctuations cause the difference in the compu¬ 
tational time among processors, it may significantly decrease 
the parallelization efficiency. To evaluate the effect of fluctua¬ 
tions, we measured the following quantity; 

AR = (9) 

^ave 

where /Tmax and are the maximum and average values 
of the number of sites with a certain color, respectively. We 
measured this quantity because the computational time is de¬ 
termined by the maximum number of sites among processors. 
Figure 9 shows the Aproc dependence of AR. The average is 
taken over the colorings and the colors of each coloring. The 
temperature T is 0.057. Because the number of colors in¬ 
creases with decreasing temperature, this temperature corre¬ 
sponds to the worst case. The data for L - 1152 and those 
for L - 2304 are denoted by circles and squares, respectively. 
When N = 1152 and Aproc = 288, AR is about 22%. This 
means that the fluctuations decrease the parallelization effi¬ 
ciency to some extent. However, we also see that the fluctua¬ 
tions decrease with increasing system size. 

In Fig. 10, fswitch/100, fcoior/100, and fMC are plotted as 
functions of the number of processors. The size L is 2304. 
The sum of the three computational times is equal to fave for 
L - 2304 shown in Fig. 7 (see Eq. (8)). The computation time 
of MC simulation fMc is dominant owing to the factor 1/100 
in (switch and fcoior- We see that each computational time satu¬ 
rates as Aproc increases. Although there are several causes of 
the saturation, such as the fluctuations in the number of sites 
discussed in the previous paragraph, the main reason for the 
saturation is an increase in communication traffic. In the MC 
calculation, we reduced communication traffic by the method 
described in Sect. 3.2. Therefore, the saturation of (mc is grad¬ 
ual. In contrast, the proportion of communication traffic is 
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Fig. 9. (Color online) AR defined by Eq. (9) is plotted as a function of Aproc 
for L = 1152 (circles) and L = 2304 (squares). The average is taken over the 
colorings and the colors of each coloring. The number of colorings is 10. The 
temperature T is 0.057. 



N 

^’proc 

Fig. 10. (Color online) /mc (squai'es), /color/ 100 (triangles), and /switch/ 100 
(circles) are plotted as functions of the number of processors Aproc- The size 
L is 2304. The average is taken over the temperatures between 0.057 and 
1.257. 


large in the potential switching and coloring because the im¬ 
provement mentioned in Sect. 3.2 is not applicable to them. 
To make the present method effective even for larger parallel 
computations, we need to improve the parallelization efficien¬ 
cies of the two processes. 

4. Conclusions 

In this study, we have developed a method of paralleliz¬ 
ing the SCO method, which is a MC method for long-range 
interacting systems. To parallelize the MC calculation in the 
SCO method, we numerically solve the vertex coloring of a 
graph created by the SCO method. This computation is per¬ 
formed in parallel using the KW algorithm.^®’We applied 
this method to a two-dimensional magnetic dipolar system on 
an LxL square lattice to examine its parallelization efficiency. 
The result showed that, in the case of L = 2304, the speed of 
computation increased about 102 times by parallel computa¬ 
tion with 288 processors. 
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